Tree-based methods stratify or segment the predictor space into a number of simple regionsJames. As the spliting rules to make these decision regions can be summerised in a tree structure, these approaches are called decision trees.
For prediction for an observation we can use the mean or mode of the tarining observations in the region to which it belongs.
Alike to SVM, these can be used for both classification and regression.
The "palmer penguins" datasetGorman contains data for 344 penguins from 3 different species and from 3 islands in the Palmer Archipelago, Antarctica.
Artwork by @allison_horst
Image(os.path.join(image_path, "lter_penguins.png"))
display(penguins.head())
display(penguins.info())
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 344 entries, 0 to 343 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species 344 non-null object 1 island 344 non-null object 2 bill_length_mm 342 non-null float64 3 bill_depth_mm 342 non-null float64 4 flipper_length_mm 342 non-null float64 5 body_mass_g 342 non-null float64 6 sex 333 non-null object dtypes: float64(4), object(3) memory usage: 18.9+ KB
None
sns.pairplot(penguins, hue="species", markers=shape_dict, palette=col_dict)
plt.show()
A decision tree breaks data down by asking a series of questions in order to categorise samples into the same class.
Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to produce binary trees, meaning nodes always have two children.
TerminologyJames
regions_tree(Xflbl, DT, flbl.columns[:-1], ['Adelie', 'Gentoo'], l_labels, r_labels, tp_labels)
An algorithm starts at a tree root and then splits the data based on the features that gives the largest information gain. This splitting procedure occours until all the samples within a given node all belong to the same classRaschka, the maximum depth of the tree is reached, or a split cannot be found that reduces impurityGéron.
To split using information gain relies on calculating the difference between an impurity measure of a parent node and the sum of the impurities of its child nodes; information gain being high when impurity of the child nodes is low.
Three impurity measures that are commonly used in binary decision trees are the classification error, gini impurity, and entropy1.
This is simply the fraction of the training observations in a region that does not belong to the most common class:
$E = 1 - \substack{max\\k}(\hat p_{mk})$.
$\hat p_{mk}$ here is the proportion of training observations in the $m$th region that are from the $k$th class.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Measures the total variance across $K$ classes,
$G = \sum^K_{k=1}\hat p_{mk}(1-\hat p_{mk})$.
It is a measure of node "purity" as a small value indicates a node contains predominantly overvations from a single class.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
regions_tree(Xflbl, DT, flbl.columns[:-1], ['Adelie', 'Gentoo'], l_labels, r_labels, tp_labels, impurity = True)
Entropy is an alternative measurement, given by
$D = -\sum^K_{k=1}\hat p_{mk}log \hat p_{mk}$.
Entropy will take on a value near 0 if the $\hat p_{mk}$'s are all near 0 or 1, therefore will take on a small value if the $m$th node is pure.
regions_tree(Xflbl, DT, flbl.columns[:-1], ['Adelie', 'Gentoo'], l_labels, r_labels, tp_labels, impurity = True)
[Insert about pruning]
Do not use Label Encoding if your categorical data is not ordinal with DecisionTreeClassifier(), you'll end up with splits that do not make sense, as the data will be treat as numericWeb.
Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive. However this can deteriorate the performance of decision trees as it leads to sparse features, which can mess up feature importanceWeb.
TODO
Web. https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree
Trees are very easy to explain
Decision trees potentially more mirror human decision-making than the regression and classification approaches previously discussed.
Trees can be displayed graphically, and are easily interpreted.
Trees can easily handle qualitative predictors without the need to create dummy variables.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Decision trees allow us assess the importance of each feature for classifying the data. The importance (or Gini importance) of a feature is the normalized total reduction of the criterion (e.g. Gini) brought by that feature.
tree_feat_import(DT, X, y, feat_names,
'Feature Importances for Classifying Adelie and Gentoo Penguins')
bill_length_mm: 0.015 bill_depth_mm: 0.029 flipper_length_mm: 0.956 body_mass_g: 0.000 total: 1.0
fig, axes = plt.subplots(figsize=(10,5))
with plt.style.context("classic"):
tree.plot_tree(DT,
feature_names=penguins_bin.drop("species",axis=1).columns,
class_names=['Adelie', 'Gentoo'],
filled = True)
plt.show()
X = penguins_cont.drop("species",axis=1).values
y = penguins_cont[["species"]].replace({'Adelie': 0, 'Gentoo': 1, "Chinstrap":2}).values.flatten()
DT = DecisionTreeClassifier(criterion='gini',
max_depth = None,
random_state=42)
DT.fit(X, y)
feat_names = penguins_cont.drop("species",axis=1).columns
fig, axes = plt.subplots(figsize=(10,5))
with plt.style.context("classic"):
tree.plot_tree(DT,
feature_names=feat_names,
class_names=['Adelie', 'Gentoo', 'Chinstrap'],
filled = True)
plt.show()
print("['Adelie', 'Gentoo', 'Chinstrap']")
['Adelie', 'Gentoo', 'Chinstrap']
Trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches.
A small change in the data can cause a large change in the final estimated tree.
However, by aggregating many decision trees, using methods like bagging, random forests, and boosting, the predictive performance of trees can be substantially improved. We introduce these concepts in the next section.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
As can be seen by the descision boundary, a decision tree is quite boxy. Furthermore, how the model makes a decision boundary is going to be affected by the rotation of the data (as DTs create straight lines).
regions_tree(Xblbd, DT, blbd.columns[:-1], ['Adelie', 'Gentoo'], [[37, 22],[51.5, 22]], r_labels, tp_labels)
regions_tree(Xblbd, DT2, blbd.columns[:-1], ['Adelie', 'Gentoo'], [[57.5, 19.5],[57.5, 14.5]])
Which approach may be better really depends on the relationship between the features and the responce.
If there is a highly non-linear and complex relationship between the features and the response then decision trees may outperform classical approaches. However if the relationship between the features and the response is well approximated by a linear model, then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure.
The relative performances of tree-based and classical approaches can be assessed by estimating the test error, using either cross-validation or the validation set approach.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Image(os.path.join("fig_8-7.png"))
A limit on nodes, or tree depth, is often set to avoid overfitting due to a deep tree2.
DT = DecisionTreeClassifier(criterion='gini',
max_depth = None,
random_state=42)
regions_tree(Xflbl, DT, flbl.columns[:-1], ['Adelie', 'Gentoo'])
Averaging methods build several separate estimators and then, as their name suggets, average their predictions.
By reducing the variance these tend to perform better than any single base estimator1.
Averaging methods generally work best when the predictors are as independent as possible, so one way of achiving this is to get diverse classifiers. This increases the chance they each make different types of errors which in combination will improve the overall accuracy3.
A bagging classifier is an ensemble of base classifiers, each fit on random subsets of a dataset. Their predictions are then pooled or aggregated to form a final prediction. This reduces variance of an estimator so can be a simple way to reduce overfitting and increase prediction accuracy1.
This is because the variance of the mean $\bar Z$ of $n$ independent observations, $Z_1,...,Z_n$, is given by $\sigma^2/n$; meaning averaging a set of observations reduces variance. Can can use this by taking many separate training sets, $B$, from the population, building a separate prediction model using each training set, $\hat f^1(x),\hat f^2(x),...,\hat f^B(x)$, and average the resulting predictions2:
$\hat f_{avg}(x)=\frac{1}{B}\sum_{b=1}^B\hat f^b(x)$.
However we generally do not have access to multiple training sets so instead we can bootstrap by taking repeated samples from a (single) training data set to create multiple bootstrapped training data sets, $B$. We then train our method on the $b$th bootstrapped training set to get $f^{∗b}(x)$ , and finally average all the predictions, to obtain2:
$\hat f_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat f^{∗b}(x)$.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Specifically, bagging is when sampling is produced with replacement2, and without replacement being called pasting3.
Pasting is designed to use smaller sample sizes than the training dataset in cases where the training dataset does not fit into memory3.
Both bagging and pasting allow training to be sampled several times across multipule predictors, with bagging only allowing several samples for the same predictor 4.
In practice, bagging tends to work best with complex models1, so although bagging is useful when applied to regression methods, they are particularly useful for decision trees 2.
Bagging has been demonstrated to give impressive improvements in accuracy by combining together hundreds or even thousands of trees into a single procedure.
These trees are grown deep, and are not pruned so each individual tree has high variance, but low bias. Averaging these $B$ trees reduces the variance.
To apply bagging to regression trees, we simply:
To apply bagging to decision trees, we simply:
Using a bagged model means we can get a validation/test error without using cross-validation.
On average, each bagged tree uses around two-thirds of the observations1, REF, therefore we can predict these out-of-bag (OOB) observations using only the trees that were not fit using those observations.
With a sufficiently amount of bags, OOB error is virtually equivalent to leave-one-out cross-validation error, which is convenient when performing bagging on large data sets for which cross-validation would be computationally onerous1.
Bagging improves prediction accuracy at the expense of interpretability.
However we can use the feature importance method as previously discussed, to get an overall summary of the importance of each predictor across trees.
# get the importances for the features
importances = np.mean([
tree.feature_importances_ for tree in bag.estimators_
], axis=0)
importances_sd = np.std([
tree.feature_importances_ for tree in bag.estimators_
], axis=0)
feat_names = penguins_cont.drop("species",axis=1).columns
importances_series = pd.Series(importances,index=feat_names).sort_values(ascending = False)
# plot the important features
importances_series.plot.bar(legend =False, grid=False, yerr=importances_sd)
plt.title('Feature Importances')
plt.xticks(rotation=45,ha='right')
plt.tight_layout()
#plt.savefig('forest_importances.png', dpi=300)
plt.show()
Random forests are essentally bagged tree classifiers, but decorrelate the trees by using a random sample of features each time a split in a tree is considered.
The random forest algorithm can therefore be summarized in four steps1:
- Draw a random bootstrap sample of size $n$.
Grow a decision tree from the bootstrap sample. At each node:
a. Randomly select $d$ features without replacement (typically the square root of the total number of predictors).
b. Split the node using the feature that provides the best split according to the objective function.
Repeat the steps above $k$ times.
- Aggregate the prediction by each tree to assign the class label by majority vote.
By not allowing the model to use the majority of the available predictors, we ensure the bagged trees look different from each other. If there is a particularly strong set of predictors in the data, then without randomly selecting features, the bagged trees will look quite similar to each other and predictions will be highly correlated. Averaging highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities2.
X = penguins_cont.drop("species",axis=1).values
y = penguins_cont[["species"]].replace({'Adelie': 0, 'Gentoo': 1, "Chinstrap":2}).values.flatten()
feat_names = penguins_cont.drop("species",axis=1).columns
class_names=['Adelie', 'Gentoo', 'Chinstrap']
RF = RandomForestClassifier(criterion='gini',
n_estimators=100,
max_samples=0.8,
max_features = 2,
random_state=42,
n_jobs=-1)
RF.fit(X, y)
plot_trees = 3
fig, axes = plt.subplots(ncols=plot_trees, figsize=(10,5))
with plt.style.context("classic"):
for i in range(plot_trees):
plt.sca(axes[i])
plot_tree(RF.estimators_[i],
feature_names=feat_names,
class_names=class_names,
filled = True)
plt.show()
Lets look at how a descion boundary created by a bagged tree could generalise better than a single tree
TODO
tvf_regions()
As averaging methods work best when the predictors are as independent as possible2, we may specifically want our trees in our ensemble to be more independent.
An extratree is similar to a tree classifier except it more randomized and thusly produces more complex trees.
Instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.
When used in an ensemble, this usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias2.
As we can see how just one tree looks, this is a very complex model!
fig, axes = plt.subplots(figsize=(10, 5))
with plt.style.context("classic"):
tree.plot_tree(ET,
feature_names=penguins_cont.columns,
class_names=['Adelie', 'Gentoo'],
filled = True)
plt.show()